Group 4 members:

Introduction

Unlike the past when job seekers used newspapers to seek job opportunities, job seekers nowadays use employment website such as JobStreet, Linkedin, Indeed and countless due to advancement in modern technology and social communication. The authenticity of job postings has become critical with a constant increase in the number of job scams. According to Habiba et all (2021), job advertisements which are fake and steal personal & professional information of job seekers instead of giving right jobs to them is known as job scam. Job scams often involve fake online job ads in social platforms and untrusted job portals offering high paying jobs. Victims may also receive unsolicited messages from social media such as Whatsapp, Facebook, WeChat that offers jobs that do not exist. For example, job scammers will ask victims to disclose personal and/or banking details or transfer upfront fees to secure a interview or more information about the fraud jobs. Due to the growing concerns about job scams, our aim is to raise awareness of job seekers in the job application process and give a early warning sign to job seekers with Machine Learning (ML) and Natural Language Processing (NLP) approaches.

Objectives

Initial Questions

Data Cleaning and Pre-processing

The dataset used in this project was published by the Employment Scam Aegean Dataset (EMSCAD) and was retrieved from Kaggle and this data contains 17,880 observations out of which about 866 are fake, and 18 features. The data consists of a combination of numeric and text features. A brief definition of the variables is given below:

Variable Description
job_id ID of each job posting
title Description of position or job
location Where the job is located
department Department of the job offered
salary_range Expected salary range
company_profile Company information
description Description about the position offered
requirements Pre-requisites to qualify for the job
benefits Benefits provided by the job
telecommuting Is work from home or remote work allowed
has_company_logo Does the post have a company logo
has_questions Does the post have any questions
employment_type Full-time, part-time, contract, temporary and others
required_experience Experience level, e.g. Entry level, Executive, Director…
required_education Education level, e.g. High School, Bachelor, Master…
industry Relevant industry
function Job’s functionality
fraudulent Target variable (0: Real, 1: Fake)

Import libraries

Load data

df <- read.csv("https://raw.githubusercontent.com/abbylmm/fake_job_posting/main/data/fake_job_postings.csv")

Display n sample of the data

df_fake_job <- df
sample_n(df_fake_job, 3)
##   job_id                                       title           location
## 1  10767                     English Teacher Abroad  US, MN, Minnetonka
## 2  13625                          Front-end Engineer BR, SP, São Paulo
## 3  17650 Data Entry Clerk / Administrative Assistant US, DC, Washington
##       department salary_range
## 1                            
## 2                            
## 3 Administrative     21-63000
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   company_profile
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           We help teachers get safe &amp; secure jobs abroad :)
## 2 Nubank is an early stage, technology-driven financial services startup funded by Sequoia Capital and Kaszek Ventures. We are building a truly global and diverse team, with people who are in the top of their areas of expertise for every position we hire, to set the new standard in financial services in Brazil. We see a significant opportunity in the credit card market in Brazil as it is currently commoditized and extremely inefficient, and therefore our first product is a credit card controlled by a mobile app, that also provides our customers full control of their finances on their mobile phones. We are based in São Paulo, Brazil.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabriel Adkins : #URL_ed9094c60184b8a4975333957f05be37e69d3cdb68decc9dd9a4242733cfd7f7##URL_75db76d58f7994c7db24e8998c2fc953ab9a20ea9ac948b217693963f78d2e6b#12 month contract : Apply today 
## 2 What are some examples of problems a front-end engineer will solve?Shipping valuable features requires close coordination between devops, database, API, and frontend workstreams. We consistently work with new technologies, and thus value professionals who are open to learning new things, regardless of pre-existing comfort zones. Nubank Front-End Engineers might solve any of the following problems:Architect scalable, high performance single page applicationsCreate interactive visualizations for live streaming data setsImplement budgeting tools to help customers better understand their spendingCreate intelligent monitors for key customer experiences and risk-relevant eventsImplement interactive / parallax effects to simplify communication of financial conceptsProduce high fidelity PDF documents with CSS3 and PhantomJSAutomated unit tests and end to end tests with Casper or NightwatchIterate on, and implement Sketch designs
## 3                                                                                                                                                                                                                                                                                                     Experienced, reliable team members are needed for our Data Entry Clerk / Administrative Assistant needed! We are currently searching for candidates with previous experience and/or motivated quick learners. These positions require a friendly phone personality, great attention to detail and the ability to work quickly and efficiently. This is a customer contact position that requires patience, a great phone demeanor, excellent verbal and written communications, and reliable work attendance.Key Aspects of Position:Provide extraordinary service to our customers at all times.Work as part of a Customer Service team.Other duties as assigned.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     requirements
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only
## 2 Relevant frontend experiences:JavaScript / Coffeescript / ClojureScript, HTML, CSSBrowser-based single page applications: Angular, Backbone, Ember, React, Mithril, Reagent, HoplonModern front end workflow, including Bower, Grunt, Gulp, etc.Interactive data visualization (e.g., d3.js, crossfilter)Automated testing (e.g., Jasmine, Karma, Mocha)Memory management and performance tuningEnglish language skills are helpful and background on functional programming/functional-style Javascript is a plus.You will fit well ifYou thrive in dynamic, fast-paced, results oriented teamsYou are hungry and enjoy being constantly challenged to learn and do moreYou embrace conflict of ideas and like to question the status quoYou learn fast and easily adapt to changing situations and prioritiesYou believe in building great products and doing great workYou want to understand the big picture, to be held accountable and make a meaningful contribution with your workYou will have a meaningful chance to shape architecture, process, and culture while working with bleeding edge technologies. We believe in good team chemistry, enthusiasm for building things, and our surprising capacity to learn new things when we stay humble and open-minded.
## 3                                                                                                                                                                                                     6 months to a year experience working in a fast pace, back to back call handling in a call center environment.High comfort level with computer-based work. Google applications knowledge and Netsuite or similar CRM/Ticketing software a plus.Must be able to multitask between various web applications.Passionate about providing stellar service to customers.The ability to be as friendly and helpful at the end of an 8-hour shift as in the beginning of the shift.Ability to work at a fast pace while maintaining accuracy.Great attention to detail, and a high sense of urgency.Excellent written and verbal communication skills.Ability to work various shifts during a 24 hour period, as schedules may vary from week to week. Solid record of good attendance at prior employer's references.All applications must be received online. No walk-ins or phone calls accepted. Due to the volume of applicants, we are unable to accept phone or email inquiries on application status. Applicants must follow these requirements in order to be considered.
##                                                                                                                                                                                               benefits
## 1                                                                                                                                                                                  See job description
## 2                     Competitive compensation packageHealth, dental and life insuranceMeal allowance (â\200œvale refeiçãoâ\200\235)Flexibility to choose your own custom setup (computer, monitors, OS etc.)
## 3 Health, Dental, Life and AD&amp;D Insurance, Employee Wellness and 401k #URL_c801649eeb4007728c8f41b2d6629d92c2295ff77e1f2d401d7696ce3569db63# Time Off and Holidays with Generous Company Discounts
##   telecommuting has_company_logo has_questions employment_type
## 1             0                1             1        Contract
## 2             0                1             1       Full-time
## 3             0                0             0       Full-time
##   required_experience required_education             industry      function.
## 1                      Bachelor's Degree Education Management               
## 2      Not Applicable  Bachelor's Degree   Financial Services    Engineering
## 3         Entry level  Bachelor's Degree   Telecommunications Administrative
##   fraudulent
## 1          0
## 2          0
## 3          1

Summary data

summary(df_fake_job)
##      job_id         title             location          department       
##  Min.   :    1   Length:17880       Length:17880       Length:17880      
##  1st Qu.: 4471   Class :character   Class :character   Class :character  
##  Median : 8940   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 8940                                                           
##  3rd Qu.:13410                                                           
##  Max.   :17880                                                           
##  salary_range       company_profile    description        requirements      
##  Length:17880       Length:17880       Length:17880       Length:17880      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    benefits         telecommuting    has_company_logo has_questions   
##  Length:17880       Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :1.0000   Median :0.0000  
##                     Mean   :0.0429   Mean   :0.7953   Mean   :0.4917  
##                     3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  employment_type    required_experience required_education   industry        
##  Length:17880       Length:17880        Length:17880       Length:17880      
##  Class :character   Class :character    Class :character   Class :character  
##  Mode  :character   Mode  :character    Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##   function.           fraudulent     
##  Length:17880       Min.   :0.00000  
##  Class :character   1st Qu.:0.00000  
##  Mode  :character   Median :0.00000  
##                     Mean   :0.04843  
##                     3rd Qu.:0.00000  
##                     Max.   :1.00000

Check all the missing values - ‘empty’

skim_without_charts(df_fake_job)
Data summary
Name df_fake_job
Number of rows 17880
Number of columns 18
_______________________
Column type frequency:
character 13
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 3 142 0 11231 0
location 0 1 0 161 346 3106 0
department 0 1 0 255 11547 1338 6
salary_range 0 1 0 20 15012 875 0
company_profile 0 1 0 6230 3308 1710 0
description 0 1 3 22722 0 14802 0
requirements 0 1 0 10921 2694 11970 0
benefits 2 1 0 4489 7206 6207 0
employment_type 0 1 0 9 3471 6 0
required_experience 0 1 0 16 7050 8 0
required_education 0 1 0 33 8105 14 0
industry 0 1 0 36 4903 132 0
function. 0 1 0 22 6455 38 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
job_id 0 1 8940.50 5161.66 1 4470.75 8940.5 13410.25 17880
telecommuting 0 1 0.04 0.20 0 0.00 0.0 0.00 1
has_company_logo 0 1 0.80 0.40 0 1.00 1.0 1.00 1
has_questions 0 1 0.49 0.50 0 0.00 0.0 1.00 1
fraudulent 0 1 0.05 0.21 0 0.00 0.0 0.00 1

Split location to country, state, city and fill empty with NA

df_fake_job[c("country", "state", "city")] <- str_split_fixed(df_fake_job$location, ", ", 3)
df_fake_job[c("country", "state", "city")][df_fake_job[c("country", "state", "city")] == ""] <- NA

Split salary_range to min_salary, max_salary and fill empty with NA

df_fake_job[c("min_salary", "max_salary")] <- str_split_fixed(df_fake_job$salary_range, "-", 2)
df_fake_job[c("min_salary", "max_salary")][df_fake_job[c("min_salary", "max_salary")] == ""] <- NA

Drop location and salary_range

df_fake_job <- select(df_fake_job, -c(location, salary_range))

View the structure of data

glimpse(df_fake_job)
## Rows: 17,880
## Columns: 21
## $ job_id              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,~
## $ title               <chr> "Marketing Intern", "Customer Service - Cloud Vide~
## $ department          <chr> "Marketing", "Success", "", "Sales", "", "", "ANDR~
## $ company_profile     <chr> "We're Food52, and we've created a groundbreaking ~
## $ description         <chr> "Food52, a fast-growing, James Beard Award-winning~
## $ requirements        <chr> "Experience with content management systems a majo~
## $ benefits            <chr> "", "What you will get from usThrough being part o~
## $ telecommuting       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ has_company_logo    <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,~
## $ has_questions       <int> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,~
## $ employment_type     <chr> "Other", "Full-time", "", "Full-time", "Full-time"~
## $ required_experience <chr> "Internship", "Not Applicable", "", "Mid-Senior le~
## $ required_education  <chr> "", "", "", "Bachelor's Degree", "Bachelor's Degre~
## $ industry            <chr> "", "Marketing and Advertising", "", "Computer Sof~
## $ function.           <chr> "Marketing", "Customer Service", "", "Sales", "Hea~
## $ fraudulent          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ country             <chr> "US", "NZ", "US", "US", "US", "US", "DE", "US", "U~
## $ state               <chr> "NY", NA, "IA", "DC", "FL", "MD", "BE", "CA", "FL"~
## $ city                <chr> "New York", "Auckland", "Wever", "Washington", "Fo~
## $ min_salary          <chr> NA, NA, NA, NA, NA, NA, "20000", NA, NA, NA, "1000~
## $ max_salary          <chr> NA, NA, NA, NA, NA, NA, "28000", NA, NA, NA, "1200~
class(df_fake_job)
## [1] "data.frame"

View column names

names(df_fake_job)
##  [1] "job_id"              "title"               "department"         
##  [4] "company_profile"     "description"         "requirements"       
##  [7] "benefits"            "telecommuting"       "has_company_logo"   
## [10] "has_questions"       "employment_type"     "required_experience"
## [13] "required_education"  "industry"            "function."          
## [16] "fraudulent"          "country"             "state"              
## [19] "city"                "min_salary"          "max_salary"

Check if any duplication id

table(duplicated(df_fake_job$job_id))
## 
## FALSE 
## 17880
# there is no duplication id

Check for total missing values for each feature

colSums(is.na(df_fake_job))
##              job_id               title          department     company_profile 
##                   0                   0                   0                   0 
##         description        requirements            benefits       telecommuting 
##                   0                   0                   2                   0 
##    has_company_logo       has_questions     employment_type required_experience 
##                   0                   0                   0                   0 
##  required_education            industry           function.          fraudulent 
##                   0                   0                   0                   0 
##             country               state                city          min_salary 
##                 346                2580                2067               15012 
##          max_salary 
##               15013

There are two missing values in ‘benefits’ column

List rows with missing values

missingdf <- df_fake_job[!complete.cases(df_fake_job), ]
sample_n(missingdf, 3)
##   job_id                   title                     department
## 1   2674 Guest Relations Officer Front Office & Guest Services 
## 2   4169     Production Engineer                               
## 3   9599             Electrician                    Maintenance
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 company_profile
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Elounda Beach Hotel &amp; VillasElounda 72053Crete
## 2 Valor Services provides Workforce Solutions that meet the needs of companies across the Private Sector, with a special focus on the Oil &amp; Gas Industry. Valor Services will be involved with you throughout every step of the hiring process and remain in contact with you all the way through the final step of signing of the employment contract with your new employer. Valor Services was founded with the vision of employing the unique skills, experiences, and qualities of Americaâ\200\231s finest veterans to provide Private Sector companies with precise and concerted value-added services â\200“ and Americaâ\200\231s finest Veterans with an optimized career opportunity.We are eager to get the word out to veterans that there are ample opportunities for employment in the private sector and that you are the ideal candidates to fill those positions. Valor Services Your Success is Our Mission. ™ 
## 3                            Niacet is a leading producer of organic salts, including propionates and acetates, serving the Food, Pharmaceutical and Technical industries. With two longstanding and fully automated manufacturing sites, located in Niagara Falls, NY USA, and Tiel, The Netherlands, Niacet offers world-class quality products to a global market. Our products fill vital needs in a broad range of applications that are essential to everyday life including food preservation, antibiotic formulation, dialysis treatment, energy production, and more.At Niacet all employees share in the growth and prosperity of the corporation. We want our employees to take pride in their personal and corporate accomplishments. Safe working conditions are achieved through continuous education of our  employees and improved facilities. We aim to provide job and financial security for all employees.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Perched on Crete&rsquo;s fabulous northeast coast, the Elounda Beach Hotel&amp; Villas brings classic island architecture to a luxury resort. Its superbly outfitted guestrooms and lush gardens overlooking &nbsp;Mirabello Bay make a visit here an exercise in relaxation and rejuvenation. Elounda Beach Hotel &amp; Villas provides guests the exceptional environment and the personalized attention that creates a singular experience. If you understand the value personalized attention and know how to treat even the most extraordinarily different experiences with the same rich level of customer service, you may be just the person we are looking for to work as a Team Member with Elounda Beach Hotels &amp; Villas .Because it&rsquo;s with Elounda Beach Hotel &amp; Villas where we promise our Guests a single rich, experience of hospitalityAs Guest Relations Officer, you will directly address the needs of VIP Guests and inform other Team Members of VIP needs in order to ensure an exceptional Guest experience. A Guest Relations Officer is responsible for managing the first impressions of our Guests and must perform their tasks to the highest standards:
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Our client, located in Oklahoma City, OK, is actively seeking an experienced Production Engineer that possesses strong project management skills. The ability to analyze data and solve problems is a must. The ideal candidate will also provide training to meet production goals.There are many opportunities for advancement in this growing company that offers strong compensation and benefits packages for qualified candidates who want to join the largest player in regional plays. Responsibilities:Perform engineering functions for production operations within a specified geographic area.Monitor production operations, costs, and profitability.Study areas for additional developmental drilling prospects.Design and implement facility and well workover plans and procedures.Generate and review AFE's for capital expenditures.Review all expenditures for properties within a specified area.Analyze production problems and direct corrective actions. Select equipment to be utilized.Assure compliance with governmental requirements and company policies.Provide training and resources to accomplish production goals.Provide expert testimony at regulatory hearings.
## 3 DEPARTMENT:      MaintenanceREPORTS TO:       Maintenance ManagerLOCATION:            Niagara Falls, NYPOSITION:              Electrician   About us: Niacet is a leading producer of organic salts, including propionates and acetates, serving the Food, Pharmaceutical and Technical industries. With two longstanding and fully automated manufacturing sites, located in Niagara Falls, NY USA, and Tiel, The Netherlands, Niacet offers world-class quality products to a global market.Our products fill vital needs in a broad range of applications that are essential to everyday life including food preservation, antibiotic formulation, dialysis treatment, energy production, and more. Electrician Position: Niagara Falls chemical manufacturer is looking for experienced electrician. General Job duties include, but are not limited to:--Maintenance of power distribution system, maintenance of instrumentation and control systems, electrical repairs to equipment, building service and repairs, installation of equipment in a chemical plant environment, housekeeping.--Must be able to read electrical diagrams, analyze problems and troubleshoot equipment operation; strong PLC and control system troubleshooting skills a plus.--May be required to move or lift up to 50lbs.--Good oral and written communication skills, experience with use of personal computers and prior chemical plant experience preferred.--Position requires support of plant maintenance needs on overtime and call-ins outside of regular hours and on weekends.--New York State Journeyman Industrial Electrician or Instrument Tech certification or equivalent experience required. We offer competitive compensation and one of the best benefit packages in the industry...
##                                                                                                                                                                                                                                                                                                                                                                                                                                                          requirements
## 1                                       To successfully fill this role, you should maintain the attitude, behaviors, skills, and values that follow:An ability to listen and respond to demanding Guest needs, Excellent leadership, interpersonal and communication skills, Accountable and resilient, Commitment to delivering a high levels of customer service, Ability to work under pressure, Flexibility to respond to a variety of different work situations.
## 2 Required:A minimum of 5 yrs' related experience or equivalent combination of education and experience.First aid and CPR certification, H2S training, and valid state operator's license. Qualifications:Bachelorâ\200\231s degree in Engineering.Company Overview:Our client is a growing company that is a leader in the Bakken Shale and Oklahoma Shale plays. The company is looking for outstanding employees, and offers strong compensation and benefits packages.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##                                                                                                  benefits
## 1 We look forward to explaining in detail the range of the benefits that you would expect from our hotel.
## 2                                                                                                        
## 3                                                                                                        
##   telecommuting has_company_logo has_questions employment_type
## 1             0                1             1                
## 2             0                1             0                
## 3             0                1             0       Full-time
##   required_experience required_education  industry     function. fraudulent
## 1                                                                         0
## 2                                                                         0
## 3    Mid-Senior level        Unspecified Chemicals Manufacturing          0
##   country state           city min_salary max_salary
## 1      GR     M Agios Nikolaos       <NA>       <NA>
## 2      US    OK  Oklahoma City       <NA>       <NA>
## 3      US    NY  Niagara Falls       <NA>       <NA>

Visualize missing rates for each feature

gg_miss_var(df_fake_job, show_pct = TRUE) + labs(y = "% Missing")

Merge columns and create a new ‘full_text’ column

viz_df <- select(df_fake_job, -c(max_salary, min_salary, state, city))
viz_df$full_text <- 
  paste(na.omit(viz_df$title), 
        na.omit(viz_df$country), 
        na.omit(viz_df$department), 
        na.omit(viz_df$company_profile), 
        na.omit(viz_df$description), 
        na.omit(viz_df$requirements), 
        na.omit(viz_df$benefits), 
        na.omit(viz_df$employment_type), 
        na.omit(viz_df$required_experience), 
        na.omit(viz_df$required_education), 
        na.omit(viz_df$industry), 
        na.omit(viz_df$function.))
viz_df[viz_df == ""] <- NA
# sample(viz_df, 3)
# write.csv(viz_df, "C:/Users/munmu/Documents/GitHub/fake_job_posting\\viz_df.csv", row.names = FALSE)

Visualize missing profile for each feature

plot_missing(viz_df)

Heatplot of missingness across the dataframe

vis_miss(viz_df)

Drop columns

model_df <- select(viz_df, 
                   -c(title, 
                      country, 
                      department, 
                      company_profile, 
                      description, 
                      requirements, 
                      benefits, 
                      employment_type, 
                      required_experience, 
                      required_education, 
                      industry, 
                      function.))
sample_n(model_df, 3)
##   job_id telecommuting has_company_logo has_questions fraudulent
## 1   3531             0                1             0          0
## 2  16893             0                1             0          0
## 3   9599             0                1             0          0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              full_text
## 1 Resolution Specialist GR  Tidewater Finance Co. was established in 1992 for the initial purpose of purchasing, and servicing retail installment contracts. There are two divisions: Tidewater Credit Services, providing indirect consumer retail finance options and Tidewater Motor Credit, providing indirect consumer auto financing. We remain committed to offering a partnership with the dealers and consumers to create a WIN-WIN-WIN situation. Our success relies solely on the success of our dealers and our consumers.Full time positions include the following benefits:40 vacation hours after 6 months of employment, 80 vacation hours after 1 year of employment6 paid holidays as well as an anniversary holiday benefitPaid personal and sick leave after 90 days of employmentFull benefits to include health, dental, life and disability insuranceA 401k plan with a company match after 6 months of employment based upon a quarterly entry dateIncentive bonuses for individual and team goals (certain positions)Bilingual Spanish eligible for differential pay Tidewater Finance Company, located in Virginia Beach, VA has a full-time position available for a Resolution Specialist. We are a growing company and this position affords an opportunity to learn and contribute within our organization. Applicant must exhibit a majority of the following characteristics including, but not limited to:Professional demeanorAdaptability and flexibilityExcellent written and verbal communication skillsAbility to multi-task and excellent time management skillsDetail orientedAbility to work in a team and independently The duties for this position include, but are not limited to:Research and respond to all escalated consumer complaints received from multiple sources to include: phone calls, e-mail, web chat, letters, e-Oscar and managementLog, track, resolve and respond to all assigned inquiries and complaints while meeting all regulatory requirements, CMS and corporate guidelinesAct as a liaison between internal departments on data gathering and problem solving while investigating problems of an unusual nature in the area of responsibilityIdentify root cause issues to ensure proper solutions and communicate findings as neededThoroughly research issues and take appropriate action to resolve them within sufficient timeAssist with special projects as assigned or directed We offer a competitive salary based on experience and a comprehensive benefits package. If you are interested in working for a dynamic and collaborative financial services company, then Tidewater Finance Company is the place for you!Please submit your resume and salary requirements to Tidewater Finance Company, 6520 Indian River Road, Virginia Beach, VA 23464, Attn: Human Resources Department. If you choose to fax or email your resume, our fax number is (757) 424-9651 and our email address is #EMAIL_169ac3804e2da6e0514e5ef76c29f157f41d80451b486889d9aa#PHONE_4dbd33c1dede3cec472e02df8f201e27aa330a9a201578720111c840de9d8117##Tidewater Finance Company is an equal opportunity employer in all aspects of employment without regard to race, age, sex, marital status, religion, disability, military status or any other characteristic or status protected by law.  Tidewater Finance Company includes Tidewater Motor Credit and Tidewater Credit Services. Applicant must exhibit a majority of the following characteristics including, but not limited to:Professional demeanorAdaptability and flexibilityExcellent written and verbal communication skillsAbility to multi-task and excellent time management skillsDetail orientedAbility to work in a team and independently Our company offers a competitive salary plus BONUSES as well as a comprehensive benefits package to our full-time employees including:40 vacation hours after 6 months of employment, 80 vacation hours after 1 year of employment6 paid holidays as well as an anniversary holiday benefitPaid personal and sick leave after 90 days of employmentHealth, dental, life, and disability insurance as well as AFLAC supplemental insuranceA 401K plan with a company match after six months of employment, however, we have quarterly enrollment periods. Full-time Entry level Unspecified Financial Services Legal
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Sales Consultant GB  LEI Home Enhancements, is an Ohio based company that has been installing windows, siding, doors and decks in homes throughout the Tri-state, Dayton, Indianapolis and Columbus for over seven years.With pride in our work, honesty and integrity in our professionalism and a companywide dedication to customer satisfaction, we offer a wide range of remodeling services to homeowners.Whether your project is large or small, we understand the trust and confidence each customer places in our skilled hands.  That's why we use only superior quality products and exceptional craftsmanship to achieve long-lasting beauty, performance and value for your home.  We take every measure to carefully ensure our craftsman are properly trained in all phases of home improvement.  Likewise, our sales staff and customer service representatives draw upon their years of experience for quality installations.From day one, we will welcome any questions and concerns you may have during the renovation process.  It is our goal to provide you with beautiful and practical home improvements that will stand the test of time, along with the peace of mind that you have made an excellent selection for your home. We are one of the fastest growing Home Improvement companies in the area. LEI is looking for motivated sales professionals to start your career.We are hiring 10-12 Sales Representatives to staff our office for our expansion starting in November!RESPONSIBILITIES:Speak with potential customers about the benefits of our home improvement products (Windows, Siding, Doors)Pitch prequalified and preset leads directly to a homeowner who is interested in buying our productsMaintain professional relationships with customers and new potential customersManage and maintain a constant influx of leadsBe helpful with all client's needsDemonstrate sample products to show customers the benefits of our productsSell the #1 rated window and siding products in America to people who already have set appointments QUALIFICATIONS:1. Applicants have to work hard and stay positive2. Must have a minimum of High School Diploma/GED or equivalent3. Applicants must be willing to complete an extensive training class that involves both our marketing and sales approaches4. Comfortable conducting business in person5. Excellent at CLOSING deals6. Professional at all times, in the office and in front of clients7. Knowledge all Windows applications WHAT WE WILL OFFER:1. WE PROVIDE PAID TRAINING AND FULL SUPPORT AT ALL TIMES, refresher meeting are provided2. We provide all preset leads, you just close the deal3. All necessary training to make you a closer4. TRAINING SALARY.5. NO COLD CALLING once you graduate sales trainingCOMPENSATION:Average monthly income of a sales consultant that works for LEI is around $3000-$15,000.Full benefits providedPaid trainingVacation Pay Full-time Entry level High School or equivalent Consumer Goods Sales
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Electrician US Maintenance Niacet is a leading producer of organic salts, including propionates and acetates, serving the Food, Pharmaceutical and Technical industries. With two longstanding and fully automated manufacturing sites, located in Niagara Falls, NY USA, and Tiel, The Netherlands, Niacet offers world-class quality products to a global market. Our products fill vital needs in a broad range of applications that are essential to everyday life including food preservation, antibiotic formulation, dialysis treatment, energy production, and more.At Niacet all employees share in the growth and prosperity of the corporation. We want our employees to take pride in their personal and corporate accomplishments. Safe working conditions are achieved through continuous education of our  employees and improved facilities. We aim to provide job and financial security for all employees. DEPARTMENT:      MaintenanceREPORTS TO:       Maintenance ManagerLOCATION:            Niagara Falls, NYPOSITION:              Electrician   About us: Niacet is a leading producer of organic salts, including propionates and acetates, serving the Food, Pharmaceutical and Technical industries. With two longstanding and fully automated manufacturing sites, located in Niagara Falls, NY USA, and Tiel, The Netherlands, Niacet offers world-class quality products to a global market.Our products fill vital needs in a broad range of applications that are essential to everyday life including food preservation, antibiotic formulation, dialysis treatment, energy production, and more. Electrician Position: Niagara Falls chemical manufacturer is looking for experienced electrician. General Job duties include, but are not limited to:--Maintenance of power distribution system, maintenance of instrumentation and control systems, electrical repairs to equipment, building service and repairs, installation of equipment in a chemical plant environment, housekeeping.--Must be able to read electrical diagrams, analyze problems and troubleshoot equipment operation; strong PLC and control system troubleshooting skills a plus.--May be required to move or lift up to 50lbs.--Good oral and written communication skills, experience with use of personal computers and prior chemical plant experience preferred.--Position requires support of plant maintenance needs on overtime and call-ins outside of regular hours and on weekends.--New York State Journeyman Industrial Electrician or Instrument Tech certification or equivalent experience required. We offer competitive compensation and one of the best benefit packages in the industry...  See job description Full-time Mid-Senior level Unspecified Chemicals Manufacturing
# write.csv(model_df, "C:/Users/munmu/Documents/GitHub/fake_job_posting\\model_df.csv", row.names = FALSE)

Check NA or missing values

sum(is.na(model_df))
## [1] 0
sum(model_df == "")
## [1] 0

Visualize missing values

vis_miss(model_df)

vis_dat(model_df)

Exploratory Data Analysis (EDA)

Before building our models, we performed exploratory data analysis to understand the dataset.

Visualize fraud and real

viz_df2 <- viz_df
viz_df2$fraudulent[viz_df2$fraudulent == 1] <- "Fraud"
viz_df2$fraudulent[viz_df2$fraudulent == 0] <- "Non Fraud"
count <- table(viz_df2$fraudulent)
bar <- barplot(count, 
               main="Proportion of fraudulent job postings", 
               xlab="fraudulent", 
               ylab="count", 
               col=c(rgb(0.3,0.1,0.4,0.6), rgb(0.3,0.9,0.4,0.6)))
text(bar, count/2, labels = count)

It is observable that there are 17,014 cases of legitimate job postings, while the number of fraudulent job postings is 866. The fraud rate of this dataset is 4.84%.

Visualize country-wise job postings

temp <- na.omit(subset(viz_df, select = c(country))) %>% 
  group_by(country) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  slice(1:10)

par(mar=c(6,4,4,4))
barplot(height=temp$n, 
        main="Top 10 country-wise job postings", 
        ylab="count", 
        col=brewer.pal(10, "Set3"), 
        names.arg=c("United States",
                    "United Kingdom",
                    "Greece",
                    "Canada",
                    "Germany",
                    "New Zealand",
                    "India",
                    "Australia",
                    "Philippines",
                    "Netherlands"), 
        cex.names=0.7, 
        las=2)

Top 10 countries with most of the number of job postings are US, GB, GR, CA, DE, NZ, IN, AU, PH, NL. United States listed 10,656 job postings, followed by 2,384 for United Kingdom and 940 for Greece.

Visualize the industries

temp <- na.omit(subset(viz_df, select = c(industry))) %>% 
  group_by(industry) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  slice(1:10)

par(mar=c(10,4,4,4))
barplot(height=temp$n, 
        names=temp$industry, 
        main="Top 10 industries", 
        ylab="count", 
        col=brewer.pal(10, "Set3"), 
        cex.names=0.6, 
        las=2)

Most job openings are IT related such as Information Technology and Services (1,734), Computer Software (1,376) and Internet (1,062).

Visualize the departments

temp <- na.omit(subset(viz_df, select = c(department))) %>% 
  group_by(department) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  slice(1:10)

par(mar=c(8,4,4,4))
barplot(height=temp$n, 
        names=temp$department, 
        main="Top 10 departments", 
        ylab="count", 
        col=brewer.pal(10, "Set3"), 
        cex.names=0.6, 
        las=2)

Top hiring departments are Sales (551), Engineering (487) and Marketing (401).

Visualize the required experiences in the jobs

viz_df %>% group_by(required_experience) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  slice(2:11) %>% 
  ggplot(aes(x=reorder(required_experience, -n), y = n)) + 
  geom_segment(aes(x=reorder(required_experience, -n), xend=reorder(required_experience, -n), y=0, yend=n), color="skyblue") + 
  geom_point(color="steelblue", size=2, alpha=1) + 
  theme_light() + 
  coord_flip() + 
  theme(panel.grid.major.y = element_blank(), 
        panel.border = element_blank(), 
        axis.ticks.y = element_blank()) + 
  theme_bw() + labs(title = "Listed jobs with required experiences", 
                    x = "Experience", 
                    y = "Count", 
                    fill = "Experience") + 
  geom_text(aes(label=round(n,0)), vjust=-0.6)

Mid-Senior level jobs are in demand, followed by entry level and associate.

Visualize the required education in the jobs

viz_df %>% group_by(required_education) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  slice(2:11) %>% 
  ggplot(aes(x=reorder(required_education, -n), y = n)) + 
  geom_segment(aes(x=reorder(required_education, -n), xend=reorder(required_education, -n), y=0, yend=n), color="skyblue") + 
  geom_point(color="steelblue", size=2, alpha=1) + 
  theme_light() + 
  coord_flip() + 
  theme(panel.grid.major.y = element_blank(), 
        panel.border = element_blank(), 
        axis.ticks.y = element_blank()) + 
  theme_bw() + labs(title = "Listed jobs with required education", 
                    x = "Education", 
                    y = "Count", 
                    fill = "Education") + 
  geom_text(aes(label=round(n,0)), vjust=-0.6)

Most of the education requirements in job ads are at least Bachelor’s degree.

Visualize fraudulent job postings based on employment types

viz_df2 <- viz_df
viz_df2$employment_type <- ifelse(is.na(viz_df2$employment_type), "Missing", viz_df2$employment_type)
df1 <- subset(viz_df2, select = c(employment_type, fraudulent)) %>% 
  group_by(employment_type, fraudulent) %>% 
  summarize(yes = sum(fraudulent==1), .groups = 'drop') %>% 
  filter(fraudulent==1)
df2 <- subset(viz_df2, select = c(employment_type, fraudulent)) %>% 
  group_by(employment_type, fraudulent) %>% 
  summarize(no = sum(fraudulent==0), .groups = 'drop') %>% 
  filter(fraudulent==0)
df_new <- merge(df1, df2, by = c("employment_type")) %>% 
  group_by(employment_type) %>% 
  summarize(pct_fraud = round(yes/(yes+no), digits=3), 
            pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>% 
  mutate(employment_type = factor(employment_type, 
                                  levels = c('Part-time',
                                             'Missing',
                                             'Other',
                                             'Full-time',
                                             'Contract',
                                             'Temporary')))
fig <- df_new %>% plot_ly(width = 700, height = 400)
fig <- fig %>% add_trace(x = ~employment_type, y = ~pct_non_fraud, type = 'bar', 
             text = ~paste0(pct_non_fraud*100,"%"), textposition = 'outside', name = 'pct_non_fraud', 
             marker = list(color = 'rgb(158,202,225)', 
                           line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% add_trace(x = ~employment_type, y = ~pct_fraud, type = 'bar', 
            text = ~paste0(pct_fraud*100,"%"), textposition = 'outside', name = 'pct_fraud', 
            marker = list(color = 'rgb(58,200,225)', 
                          line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% layout(title = "Employment types with % fraud and non-fraud",
         barmode = 'group',
         xaxis = list(title = "employment_type"),
         yaxis = list(title = "percentage"))
fig

The percentage of fraudulent job postings is the highest for part-time jobs, nearly 9%. Jobs without an employment type also have a high fraud rate, around 7%.

Visualize fraudulent job postings based on required experiences

viz_df2 <- viz_df
viz_df2$required_experience <- ifelse(is.na(viz_df2$required_experience), "Not Applicable", viz_df2$required_experience)
df1 <- subset(viz_df2, select = c(required_experience, fraudulent)) %>% 
  group_by(required_experience, fraudulent) %>% 
  summarize(yes = sum(fraudulent==1), .groups = 'drop') %>% 
  filter(fraudulent==1)
df2 <- subset(viz_df2, select = c(required_experience, fraudulent)) %>% 
  group_by(required_experience, fraudulent) %>% 
  summarize(no = sum(fraudulent==0), .groups = 'drop') %>% 
  filter(fraudulent==0)
df_new <- merge(df1, df2, by = c("required_experience")) %>% 
  group_by(required_experience) %>% 
  summarize(pct_fraud = round(yes/(yes+no), digits=3), 
            pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>% 
  mutate(required_experience = factor(required_experience, 
                                      levels = c('Executive',
                                                 'Entry level',
                                                 'Not Applicable',
                                                 'Director',
                                                 'Mid-Senior level',
                                                 'Internship',
                                                 'Associate')))
fig <- df_new %>% plot_ly(width = 700, height = 400)
fig <- fig %>% add_trace(x = ~required_experience, y = ~pct_non_fraud, type = 'bar', 
             text = ~paste0(pct_non_fraud*100,"%"), textposition = 'outside', name = 'pct_non_fraud', 
             marker = list(color = 'rgb(158,202,225)', 
                           line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% add_trace(x = ~required_experience, y = ~pct_fraud, type = 'bar', 
            text = ~paste0(pct_fraud*100,"%"), textposition = 'outside', name = 'pct_fraud', 
            marker = list(color = 'rgb(58,200,225)', 
                          line = list(color = 'rgb(8,48,107)', width = 0.8)))
fig <- fig %>% layout(title = "Required experiences with % fraud and non-fraud",
         barmode = 'group',
         xaxis = list(title = "required_experience"),
         yaxis = list(title = "percentage"))
fig

Most executive or entry level jobs that require minimum qualifications and little experience have highest fraud rate, nearly 7%.

Visualize fraudulent job postings based on job functions

viz_df2 <- viz_df
viz_df2$fraudulent[viz_df2$fraudulent == 1] <- "Fraud"
viz_df2$fraudulent[viz_df2$fraudulent == 0] <- "Non Fraud"
temp <- na.omit(subset(viz_df2, select = c(function., fraudulent))) %>% 
  group_by(function., fraudulent) %>% 
  summarize(n = n(), .groups = 'drop') %>% 
  group_by(function.) %>% 
  summarize(pct_fraud = round(sum(n[fraudulent=="Fraud"]/sum(n)), digits=3), 
            pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>% 
  arrange(desc(pct_fraud)) %>% 
  slice(1:10) %>% 
  mutate(function. = factor(function., 
                            levels = c('Administrative',
                                       'Financial Analyst',
                                       'Accounting/Auditing',
                                       'Distribution',
                                       'Other',
                                       'Finance',
                                       'Engineering',
                                       'Business Development',
                                       'Advertising',
                                       'Customer Service')))
melted_temp <- melt(temp, id = "function.")
ggplot(melted_temp, aes(x = function., y = value, fill = variable)) + 
  geom_bar(position = "fill", 
           stat = "identity", 
           color = "black", 
           width = 0.8) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.6)) + 
  scale_y_continuous(labels = scales::percent) + 
  geom_text(aes(label = paste0(value*100,"%")), 
            position = position_stack(vjust = 0.6), size = 2) + 
  ggtitle("Job functions with % fraud and non-fraud") + 
  xlab("function") + 
  ylab("percentage")

The function with highest fraudulent job postings is Administrative, close to 19%, followed by Financial Analyst, Accounting/Auditing. Admin jobs seem most suspicious. Possibly, it’s easy for scammers to disguise their scams.

Visualize fraudulent job postings based on required education

temp <- na.omit(subset(viz_df2, select = c(required_education, fraudulent))) %>% 
  group_by(required_education, fraudulent) %>% 
  summarize(n = n(), .groups = 'drop') %>% 
  group_by(required_education) %>% 
  summarize(pct_fraud = round(sum(n[fraudulent=="Fraud"]/sum(n)), digits=3), 
            pct_non_fraud = 1-pct_fraud, .groups = 'drop') %>% 
  arrange(desc(pct_fraud)) %>% 
  slice(1:10) %>% 
  mutate(required_education = factor(required_education, 
                                     levels = c("Some High School Coursework",
                                                "Certification",
                                                "High School or equivalent",
                                                "Master's Degree",
                                                "Professional",
                                                "Unspecified",
                                                "Doctorate",
                                                "Some College Coursework Completed",
                                                "Associate Degree",
                                                "Bachelor's Degree")))
melted_temp <- melt(temp, id = "required_education")
ggplot(melted_temp, aes(x = required_education, y = value, fill = variable)) + 
  geom_bar(position = "fill", 
           stat = "identity", 
           color = "black", 
           width = 0.8) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.6)) + 
  scale_y_continuous(labels = scales::percent) + 
  geom_text(aes(label = paste0(value*100,"%")), 
            position = position_stack(vjust = 0.6), size = 2) + 
  ggtitle("Required education with % fraud and non-fraud") + 
  xlab("required_education") + 
  ylab("percentage")

As high as 74% of fake jobs require little educational credentials - “Some High School Coursework”.

Word Cloud

To visualize the fraud and real job postings, the WordCloud is used to see the top occurring keywords in the data. To do so, fraud and real job postings are separated into two text files and WordCloud has plotted accordingly.

Word Cloud of fraudulent job postings

selected_df <- subset(viz_df, fraudulent == 1)

# Create a vector containing only the text
text <- selected_df$title

# Create a corpus
docs <- Corpus(VectorSource(text))

docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing=TRUE)
df <- data.frame(word = names(words), freq=words)

wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))

Many of the fraudulent job postings have common keywords in the job titles - “Data Entry”, “Administrative”, “Home Based”, “Earn Daily”.

Word Cloud of NON-fraudulent job postings

selected_df <- subset(viz_df, fraudulent == 0)

# Create a vector containing only the text
text <- selected_df$title

# Create a corpus
docs <- Corpus(VectorSource(text))

docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))

dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing=TRUE)
df <- data.frame(word = names(words), freq=words)

wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))

Many of the NON-fraudulent job postings have common keywords in the job titles - “Manager”, “Developer”, “Engineer”.

Modeling

Before modeling, a final dataset is determined. This project will use a dataset with these features for the final analysis:

Three supervised machine learning algorithms used in the project are:

Data pre-process (full_text)

For this analysis, the entire full_text column is converted to a DocumentTermMatrix and then convert to a dataframe.

# temp <- subset(model_df, fraudulent == 1)
docs <- Corpus(VectorSource(model_df$full_text))
docs <- docs %>%
  tm_map(removeNumbers) %>% # Remove numbers
  tm_map(removePunctuation) %>% # Remove punctuation
  tm_map(stripWhitespace) # Eliminate extra white spaces
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))

# Convert each full_text into a row with columns containing each term in the document and giving the frequency of unique words used in the full_text
dtm <- DocumentTermMatrix(docs)
sparse_data <- removeSparseTerms(dtm, 0.90) # Remove sparse data
# Convert to dataframe for further analysis
sparse_data_df <- as.data.frame(as.matrix(sparse_data))
final_df <- subset(sparse_data_df, select = -c(``))

# Add other variables
final_df$telecommuting <- model_df$telecommuting
final_df$has_company_logo <- model_df$has_company_logo
final_df$has_questions <- model_df$has_questions
final_df$fraudulent <- model_df$fraudulent

View the dimension of the dataframe

dim(final_df)
## [1] 17880   313
# 17880 rows, 313 columns

Visualize data

# Histogram
par(mfrow=c(2,2))
for(i in 310:313) {
    hist(final_df[,i], main=names(final_df)[i], border="blue", col="yellow")
}

# Boxplot
par(mfrow=c(2,2))
for(i in 310:313) {
    boxplot(final_df[,i], main=names(final_df)[i], border="blue", col="yellow")
}

Correlation

A correlation matrix is created to visualize the numeric data relationship.

# Calculate the correlation between each pair of numeric variables
selected_df<-final_df[, 310:313]
corr_df <- round(cor(selected_df), 2)
corr_df
##                  telecommuting has_company_logo has_questions fraudulent
## telecommuting             1.00            -0.02          0.02       0.03
## has_company_logo         -0.02             1.00          0.23      -0.26
## has_questions             0.02             0.23          1.00      -0.09
## fraudulent                0.03            -0.26         -0.09       1.00

Visualise correlation heatmap

# reduce the size of correlation matrix
melted_corr_mat<-melt(corr_df)
## Warning in melt(corr_df): The melt generic in data.table has been passed a
## matrix and will attempt to redirect to the relevant reshape2 method; please note
## that reshape2 is deprecated, and this redirection is now deprecated as well.
## To continue using melt methods from reshape2 while both libraries are attached,
## e.g. melt.list, you can prepend the namespace like reshape2::melt(corr_df). In
## the next version, this warning will become an error.
# plotting the correlation heatmap

ggplot(data = melted_corr_mat, aes(x = Var1, y = Var2, fill = value))+
  geom_tile() +
  geom_text(aes( label = value), color = "black", size = 4)

It can be seen that all features are not highly correlated however has_company_logo and has_questions have negative correlation with fraudulent. This indicates that If the job posting has a company logo or with questions, the likelihood of fraudulent decreases.

Split data into 70% training, 30% testing

# Using the same seed value, reproduce the division of the training and testing sets
set.seed(123)
train_index <- sample(dim(final_df)[1], 0.7 * dim(final_df)[1])
model_dftrain<- final_df[train_index, ]
model_dftest <- final_df[-train_index, ]
paste("train sample size: ", dim(model_dftrain)[1])
## [1] "train sample size:  12516"
paste("test sample size: ", dim(model_dftest)[1])
## [1] "test sample size:  5364"

View training set

sample_n(model_dftrain, 3)
##       also amp andor around attention best big business communication company
## 17169    0   0     0      0         0    1   0        0             0       0
## 3482     0   0     0      0         0    0   0        1             0       0
## 9043     1   0     0      0         0    0   0        0             1       1
##       content currently daily drive engineering existing experience full highly
## 17169       0         0     0     0           0        0          0    0      0
## 3482        0         0     0     0           1        1          2    0      0
## 9043        0         0     0     0           0        0          0    0      0
##       hours information like long management market marketing media need new
## 17169     0           1    0    0          1      0         0     0    0   0
## 3482      0           2    1    0          0      0         0     0    0   1
## 9043      0           0    0    1          1      0         0     0    0   2
##       offer office one online people plus small social staff startup support
## 17169     0      0   0      0      0    0     0      0     0       0       0
## 3482      0      0   1      0      0    0     0      0     0       0       0
## 9043      1      1   0      0      0    0     0      0     0       0       0
##       systems talented team technology top using various website work working
## 17169       0        0    1          1   0     0       1       0    0       0
## 3482        0        0    2          3   0     1       0       0    0       1
## 9043        0        0    0          0   0     0       2       0    1       0
##       able apply based can candidates client clients communicate companies
## 17169    0     0     0   0          0      0       0           0         0
## 3482     0     0     0   0          0      0       0           0         0
## 9043     0     0     0   0          0      0       0           0         0
##       computer cost creative customer delivery effectively email environment
## 17169        0    0        0        0        0           0     0           0
## 3482         2    0        0        0        0           0     0           0
## 9043         0    0        0        1        0           1     0           0
##       every excellent fast following fulltime get global great grow growing
## 17169     0         0    0         0        1   0      0     0    0       0
## 3482      0         0    0         0        1   0      0     0    0       0
## 9043      0         0    0         0        2   0      0     0    0       1
##       growth high include including international issues key know knowledge
## 17169      0    0       0         0             0      0   0    0         0
## 3482       0    2       0         0             0      0   0    0         0
## 9043       0    0       0         1             0      0   0    0         1
##       large learn level looking making manage manager managing network
## 17169     0     0     1       0      0      0       0        0       0
## 3482      0     0     0       0      0      0       0        0       0
## 9043      0     0     0       0      0      0       0        1       0
##       opportunity part passion person phone planning platform please position
## 17169           0    0       0      0     0        0        0      0        1
## 3482            2    1       0      0     0        0        0      0        0
## 9043            0    0       0      0     0        0        0      0        0
##       process product production project projects provides quality range right
## 17169       0       1          0       0        0        0       0     0     0
## 3482        0       0          0       0        0        0       2     0     0
## 9043        0       2          0       0        0        0       0     0     0
##       role service skills software success successful system teams understand
## 17169    0       0      0        1       0          0      0     0          0
## 3482     0       0      0        3       0          0      0     1          1
## 9043     0       2      1        0       0          1      0     0          0
##       web will world across activities candidate career contract engineer
## 17169   0    1     0      0          0         0      1        0        0
## 3482    2    1     0      0          0         0      0        0        0
## 9043    1    0     0      0          0         0      0        0        0
##       ensure experienced field focus health ideal meet must needs opportunities
## 17169      1           0     0     0      0     1    0    0     0             0
## 3482       0           0     0     0      0     0    0    0     0             0
## 9043       0           0     0     0      3     0    0    0     1             0
##       per provide requirements resources seeking services solutions strong
## 17169   0       0            0         0       0        0         2      0
## 3482    0       0            0         0       0        1         5      0
## 9043    0       0            1         0       0        1         0      0
##       unique vision way ability analysis available bachelors benefits build
## 17169      0      0   0       0        0         0         1        0     0
## 3482       0      0   0       0        0         0         1        0     1
## 9043       0      1   0       1        1         0         0        1     0
##       competitive culture customers degree develop development equivalent first
## 17169           0       0         0      1       0           2          0     0
## 3482            0       0         0      2       1           3          0     0
## 9043            1       0         0      1       0           0          0     0
##       goals good help industry lead life maintain make midsenior motivated
## 17169     0    0    0        0    0    0        0    0         1         0
## 3482      0    0    0        1    0    1        0    0         0         0
## 9043      0    0    0        0    0    1        0    0         0         0
##       order organization personal problem professional providing related
## 17169     0            0        0       0            0         0       0
## 3482      0            0        0       0            0         0       0
## 9043      0            0        0       0            0         0       1
##       responsible sales strategy travel understanding value verbal within
## 17169           0     0        0      1             0     0      0      0
## 3482            0     0        0      0             0     0      0      0
## 9043            0     1        0      0             0     0      0      1
##       written year years care current deliver directly innovative interested
## 17169       0    0     0    0       0       0        0          0          0
## 3482        0    0     1    0       0       2        0          1          0
## 9043        0    1     1    2       0       0        0          0          0
##       job leadership monthly offers open operations performance positions
## 17169   0          0       0      0    0          0           0         0
## 3482    0          0       0      0    0          0           0         0
## 9043    0          0       0      1    1          0           0         0
##       potential preferred processes reports results standards time training
## 17169         0         0         1       0       0         0    1        0
## 3482          0         0         0       0       0         0    0        0
## 9043          0         0         0       0       0         0    3        0
##       well areas come design driven employees excel financial join relevant
## 17169    1     0    0      1      0         0     0         0    0        0
## 3482     0     0    0      0      0         0     0         0    0        0
## 9043     0     0    0      0      0         1     0         0    0        1
##       school senior technical we’re without brand dynamic ideas leading many
## 17169      0      1         0     0       0     0       0     0       0    0
## 3482       0      0         0     1       0     0       1     0       0    0
## 9043       0      0         0     0       0     0       0     0       0    1
##       mobile take creating flexible free just love minimum mission multiple
## 17169      0    0        0        0    0    0    0       0       0        0
## 3482       0    0        0        0    0    0    0       0       0        0
## 9043       0    0        0        1    0    0    0       0       0        1
##       passionate play record required use want applications associate change
## 17169          0    0      0        0   0    0            0         0      0
## 3482           0    1      0        0   0    0            3         0      0
## 9043           1    0      0        0   0    0            0         1      0
##       tools background delivering duties entry improve months reporting tasks
## 17169     0          0          0      0     0       1      0         0     0
## 3482      0          0          0      0     0       0      0         0     0
## 9043      0          0          0      1     1       0      0         0     1
##       agency building data developer developing digital internal learning
## 17169      0        0    0         2          0       0        0        0
## 3482       0        0    0         1          2       0        0        0
## 9043       0        0    1         0          0       0        0        0
##       products technologies closely employee internet start track application
## 17169        0            0       0        0        0     0     0           0
## 3482         1            0       0        0        0     0     0           2
## 9043         2            0       0        0        0     0     0           0
##       create established may user hard insurance believe now plan problems
## 17169      0           0   0    0    1         0       0   0    0        0
## 3482       1           0   0    0    0         0       0   0    0        0
## 9043       0           2   0    0    0         2       0   1    1        0
##       complex day education individuals relationships jobs fun see english
## 17169       0   0         0           0             0    0   0   0       0
## 3482        0   0         0           0             0    0   0   0       0
## 9043        0   0         2           0             0    0   0   0       0
##       individual salary dental group package paid medical exciting members
## 17169          0      0      0     0       0    0       0        0       1
## 3482           0      0      0     0       0    0       0        0       0
## 9043           0      0      1     0       1    1      12        0       1
##       least telecommuting has_company_logo has_questions fraudulent
## 17169     0             0                1             1          0
## 3482      1             0                1             1          0
## 9043      0             0                1             0          0

Convert the dependent variable as a factor

model_dftrain$fraudulent = as.factor(model_dftrain$fraudulent)
model_dftest$fraudulent = as.factor(model_dftest$fraudulent)

Logistic Regression

# Train logistic regression
lr_model <- glm(formula = fraudulent ~ ., family = "binomial", data = model_dftrain)

Predict the testing set

lr_pred_test <- predict(lr_model, newdata = model_dftest, type = "response")
test <- model_dftest
glm.probs = predict(lr_model, newdata = test, type = "response")
test$pred_glm = ifelse(glm.probs > 0.5, "1", "0")
test$pred_glm = as.factor(test$pred_glm)

Calculate AUC of the model

calcAUC <- function(predcol, outcol) {
  perf <- performance(prediction(as.numeric(predcol), outcol == 1), "auc")
  as.numeric(perf@y.values)
}

paste("AUC of Logistic Regression is", round(calcAUC(lr_pred_test, model_dftest$fraudulent), digits=4))
## [1] "AUC of Logistic Regression is 0.953"

Random Forest

# Train random forest
trcontrol <- trainControl(method = "repeatedcv", number = 2, repeats = 1, search = "random", verboseIter = TRUE)
grid <- data.frame(mtry = c(100))
rf_model <- train(fraudulent ~ ., method = "rf", data = model_dftrain, ntree = 200, trControl = trcontrol, tuneGrid = grid)
## + Fold1.Rep1: mtry=100 
## - Fold1.Rep1: mtry=100 
## + Fold2.Rep1: mtry=100 
## - Fold2.Rep1: mtry=100 
## Aggregating results
## Fitting final model on full training set
rf_model
## Random Forest 
## 
## 12516 samples
##   312 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 1 times) 
## Summary of sample sizes: 6258, 6258 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9691595  0.5253441
## 
## Tuning parameter 'mtry' was held constant at a value of 100

Predict the testing set

rf_pred_test <- predict(rf_model, newdata = model_dftest)

Calculate AUC of the model

paste("AUC of Random Forest is", round(calcAUC(rf_pred_test, model_dftest$fraudulent), digits=4))
## [1] "AUC of Random Forest is 0.8028"

K-Nearest Neighbor (KNN)

# Train knn
knn <- kknn(fraudulent ~ ., model_dftrain, model_dftest, k = 25)
# View(knn)

Predict the testing set

knn_pred_test <- predict(knn, newdata = model_dftest)

Calculate AUC of the model

paste("AUC of KNN is", round(calcAUC(knn_pred_test, model_dftest$fraudulent), digits=4))
## [1] "AUC of KNN is 0.767"

Evaluation

Accuracy and area under the curve (AUC) are used to evaluate the effectiveness of models in terms of classifying real and fake job postings. However, the dataset used for training is highly imbalanced. Thus, it is necessary to use F1 scores, precision, and recall to evaluate the model’s ability to identify both real and fake job postings.

# Error Metrics -- Confusion Matrix
err_metric=function(CM)
{
  TN = CM[1,1]
  TP = CM[2,2]
  FP = CM[1,2]
  FN = CM[2,1]
  precision = (TP)/(TP+FP)
  recall_score = (FP)/(FP+TN)
  
  f1_score = 2*((precision*recall_score)/(precision+recall_score))
  accuracy_model = (TP+TN)/(TP+TN+FP+FN)
  False_positive_rate = (FP)/(FP+TN)
  False_negative_rate = (FN)/(FN+TP)
  
  print(paste("Precision value of the model: ", round(precision,2)))
  print(paste("Accuracy of the model: ", round(accuracy_model,2)))
  print(paste("Recall value of the model: ", round(recall_score,2)))
  print(paste("False Positive rate of the model: ", round(False_positive_rate,2)))
  print(paste("False Negative rate of the model: ", round(False_negative_rate,2)))
  print(paste("f1 score of the model: ", round(f1_score,2)))
}

Confusion Matrix and Error Metrics of Logistic Regression

confMatrix_lr = table(test$pred_glm, test$fraudulent)
print(confMatrix_lr)
##    
##        0    1
##   0 5025  114
##   1   67  158
err_metric(confMatrix_lr)
## [1] "Precision value of the model:  0.58"
## [1] "Accuracy of the model:  0.97"
## [1] "Recall value of the model:  0.02"
## [1] "False Positive rate of the model:  0.02"
## [1] "False Negative rate of the model:  0.3"
## [1] "f1 score of the model:  0.04"

Confusion Matrix and Error Metrics of Random Forest

confMatrix_rf = table(rf_pred_test, model_dftest$fraudulent)
print(confMatrix_rf)
##             
## rf_pred_test    0    1
##            0 5087  107
##            1    5  165
err_metric(confMatrix_rf)
## [1] "Precision value of the model:  0.61"
## [1] "Accuracy of the model:  0.98"
## [1] "Recall value of the model:  0.02"
## [1] "False Positive rate of the model:  0.02"
## [1] "False Negative rate of the model:  0.03"
## [1] "f1 score of the model:  0.04"

Confusion Matrix and Error Metrics of KNN

confMatrix_knn = table(knn_pred_test, model_dftest$fraudulent)
print(confMatrix_knn)
##              
## knn_pred_test    0    1
##             0 5078  126
##             1   14  146
err_metric(confMatrix_knn)
## [1] "Precision value of the model:  0.54"
## [1] "Accuracy of the model:  0.97"
## [1] "Recall value of the model:  0.02"
## [1] "False Positive rate of the model:  0.02"
## [1] "False Negative rate of the model:  0.09"
## [1] "f1 score of the model:  0.05"

Summary of Results

Metric Logistic Regression Random Forest KNN
Accuracy 0.97 0.98 0.97
Precision 0.58 0.61 0.54
Recall 0.02 0.02 0.02
F1 0.04 0.04 0.05
AUC 0.95 0.80 0.77

The Random Forest has achieved the best accuracy and precision while its f1 score is a little bit lower than KNN. However, Logistic Regression has achieved the highest AUC than others while its precision is lower than Random Forest. Given the precision score, we can conclude that Random Forest is the best in terms of classifying real and fake job postings.

Result Analysis Summary

  1. What are the key features/characteristics of fraudulent job postings?

Based on the correlation analysis, all of the features are not highly correlated to our target feature (fraudulent) and therefore, it is difficult to find out the key features or characteristics of fraudulent job postings. However, it can be seen that has_company_logo and has_questions features have negative correlation with fraudulent. This may indicates that If the job posting has a company logo or with questions, the likelihood of fraudulent decreases.

  1. Which classification model is the best to determine whether the job is real or not?

Random Forest is the best classification model to determine whether the job is real or not. This conclusion was made in regard to Random Forest model has shown the best accuracy and precision result compared to other models.

  1. Other findings

Limitation and Improvement

Since the dataset is highly imbalanced where most of the job postings are legitimate, and only few are fraudulent. Thus, real jobs are being identified quite well. Techniques to handle imbalanced data like SMOTE can be applied to make a fair comparison between real and fraudulent jobs. Besides, other NLP processing like TF-IDF vectorizer can be chosen to discover the best possible numerical/vectorial representation of the text strings for running ML models.

Conclusion

In most instances, if something appears too good to be true, it probably is. Most of the fraudulent job description and requirements are vague and too good to be true such as easy work for unrealistic pay. Be aware of part-time, entry-level jobs that require minimum qualifications and little experience like data entry and administrative. Home based and job listings without company logo can be alarming. In terms of classification models, Random Forest gives the best accuracy and precision, however better results can be achieved with a more balanced dataset with sufficient use cases for both real and fake job postings. Finally, with a little research, we can not only find out if a company and a job are legit, but also discover if the company is a right fit.